Data Visualization · FH Potsdam · Summer 2023
In this last instalment of the data visualization tutorials we will be analyzing and visualizing geographic data; i.e., data that refers to geospatial entities. Geospatial entities can, for example, be particular places such as schools and libraries or political boundaries of cities or countries. Of course, this tutorial only scratches the surface. Consider this as a teaser into the world of geovisualization, which in itself has become a branch of research and practice at the intersection of geography and visualization. We will only touch on a few basic steps to get your feet wet and hands dirty.
As you come to expect by now we first assemble our tools and then prepare the data.
For this tutorial we will continue to rely on Altair and Pandas, but add a few packages for geospatial analysis, such as GeoPandas, which will help us to work with DataFrames that contain spatial entities to carry out geometric analysis on them. As before, the additional packages are also listed in the file requirements.txt:
import altair as alt
import pandas as pd
import geopandas as gpd
To access the data of the OpenStreetMap, we will use methods from the handy package OSMPythonTools and we will draw from GeoPy, which will help us turn addresses into geographic coordinates—these two will be imported later in this notebook.
Once we have the tools assembled, we can get started working with geospatial data. There are actually plenty of formats used to record geospatial data, but GeoJSON has become an important standard for exchanging geospatial data on the web. However, please note that GeoPandas can actually load many other vector-based data formats used in digital cartography, such as shapefiles and GeoPackage.
Suppose we would like to get the geographic boundaries of Potsdam's districts, which happen to be published on Potsdam's Open Data Portal. Akin to how we would read a JSON file with Pandas, we can also use read_file() provided by GeoPandas simply by passing the URL to the data set of interest and get a geographic DataFrame back:
districts = gpd.read_file("https://opendata.potsdam.de/explore/dataset/statistische-bezirke-in-potsdam/download/?format=geojson")
If you replace geojson with shp at the end of the above URL, you can also load this data in a shapefile format. Either way the data is going to be loaded into the data structure of a GeoDataFrame. The main difference with a regular Pandas DataFrame is that a GeoDataFrame features a geometry column, which is a geoseries containing the points, paths, and polygons for each row. For example, if each row represents one district, the respective geometries would probably contain the geospatial boundaries…
✏️ Are you curious what the districts DataFrame actually looks like? Take a look at it with the methods you know by now:
Geographically speaking, the districts are defined by their geographic shapes, which are represented as polygons, each of which is a list of tuples of geographic coordinates. Next we add information about schools in Potsdam:
schools = gpd.read_file("https://opendata.potsdam.de/explore/dataset/schulen/download/?format=geojson")
✏️ Have a look at the schools as well, and compare the contents of the geometry columns in schools and districts. Do you notice anything?
OpenStreetMap (OSM) is "a collaborative project to create a free editable map of the world". As such it has millions of contributing users who have been collecting, updating and refining map data for over 15 years, which has generated a vastly comprehensive source of geographic data. It is by no means complete — whatever this would mean — but it is an impressively large geographic database and, of course, a map in itself, too.
To get a list of libraries in Potsdam (according to the OSM), we first need to find the right Potsdam. For this we use the geocoding powers of OSM through the Nominatim search service:
from OSMPythonTools.nominatim import Nominatim
nominatim = Nominatim()
city = nominatim.query('Potsdam, Germany')
city.areaId()
3600062369
OpenStreetMap has its own kind of query language, which is quite compact and can also be a source for errors. To make query formulation easier, you can either use the web interface overpass turbo or the overpassQueryBuilder, which provides access to the main parameters:
from OSMPythonTools.overpass import overpassQueryBuilder
library_query = overpassQueryBuilder(
area=city.areaId(), # the query can be contrained by an area of an item
elementType='node', # which are points (OSM also has ways and relations)
# the selector in the next line is really the heart of the query:
selector='"amenity"="library"', # we're looking for libraries
out='body', # body indicates that we want the data, not just the count
includeGeometry=True # and we want the geometric information, too
)
library_query
'area(3600062369)->.searchArea;(node["amenity"="library"](area.searchArea);); out body geom;'
The output of the above cell is the compact version of the query, which is carried out in the next step:
from OSMPythonTools.overpass import Overpass
overpass = Overpass()
lib_data = overpass.query(library_query)
The variable lib_data now already contains the result from the query against OSM. Let's have a look at it. With elements() we can access the retrieved points. Let's take a look at the first entry:
lib_data.elements()[0].tags()
{'addr:city': 'Potsdam',
'addr:country': 'DE',
'addr:housenumber': '5',
'addr:postcode': '14469',
'addr:street': 'Kiepenheuerallee',
'amenity': 'library',
'contact:email': 'bibliothek@fh-potsdam.de',
'contact:fax': '+49 331 580 2229',
'contact:phone': '+49 331 580 2211',
'contact:website': 'https://www.fh-potsdam.de/informieren/organisation/wiss-einrichtungen/bibliothek/bibliothek-news/',
'name': 'Hochschulbibliothek',
'opening_hours': 'Mo-Fr 09:00-19:00; Sa 09:00-14:30; PH,Su off; 2019 Dec 21-2020 Jan 01 off',
'opening_hours:url': 'https://www.fh-potsdam.de/informieren/organisation/wiss-einrichtungen/bibliothek/wir-ueber-uns/oeffnungszeiten/',
'operator': 'Fachhochschule Potsdam',
'operator:type': 'public',
'operator:wikidata': 'Q896706',
'operator:wikipedia': 'de:Fachhochschule Potsdam',
'ref:isil': 'DE-525',
'wheelchair': 'limited'}
Similarly, we can also access the geometry, which in this case is just a point:
lib_data.elements()[0].geometry()
{"coordinates": [13.051358, 52.41372], "type": "Point"}
Next, we use the compact form of a list comprehension to extract the libraries' names and coordinates:
libraries = [ (lib.tag("name"), lib.geometry() ) for lib in lib_data.elements()]
… which we turn into a GeoDataFrame. By naming the second column geometry we indicate towards GeoPandas to interpret the coordinates as geographic locations:
libraries = gpd.GeoDataFrame(libraries, columns = ['name', 'geometry'])
Let's repeat the process to retrieve Berlin's trees (as recorded by the OSM community):
# 1. prepare query (and directly include the location lookup)
tree_query = overpassQueryBuilder(
area=nominatim.query('Berlin, Germany').areaId(),
elementType='node',
selector='"natural"="tree"',
out='body',
includeGeometry=True
)
# 2. execute query (and give it a bit more time to finish)
tree_data = overpass.query(tree_query, timeout=60)
# 3. get ids and coordinates of trees
tree_locs = [ (tree.id(), tree.geometry()) for tree in tree_data.elements()]
# 4. create GeoDataFrame
trees = gpd.GeoDataFrame(tree_locs, columns=["id", "geometry"])
trees.head()
| id | geometry | |
|---|---|---|
| 0 | 21487172 | POINT (13.35177 52.51431) |
| 1 | 26908663 | POINT (13.34627 52.47173) |
| 2 | 27306554 | POINT (13.30086 52.52252) |
| 3 | 27306733 | POINT (13.30062 52.52235) |
| 4 | 30429119 | POINT (13.30101 52.52255) |
What are you interested in? You may want to consult the Overpass API or play around with overpass turbo.
✏️ Copy the above code into the cell below and adjust the query and variable names according to your interest:
One important processing step is turning addresses into geolocations.
A typical challenge before actually visualizing geographic data is to extract the geographic coordinates of items of interest (may they be trees or libraries). When loading GeoJSON files or querying OpenStreetMap, the geometries are already included. However, oftentimes we may only have street addresses or the names of geographic entities such as cities or points of interests, which cannot be directly used to derive positions on a map. Yet, intuitively speaking, names of places and street addresses are unique ways of identifying locations. To turn such geographic strings into geographic tuples we rely on geocoding. In short: geocoding translates an address or place name into geographic coordinates.
There are plenty of commercial geocoders out there, but for the purpose of this tutorial we simply use OpenStreetMap's Nominatim service via the handy GeoPy package. (We could also stick to the OSMPythonTools that we already used above, but GeoPy has some handy ways of carrying out multiple geocoding steps in a batch.)
# import Nominatim as geopy geocoder
from geopy.geocoders import Nominatim
# register a custom user agent (commercial services may also require an API key)
geocoder = Nominatim(user_agent="Data Visualization Tutorial 2023 · FH Potsdam")
Let's start with an address that students of FH Potsdam might be familiar with:
loc = geocoder.geocode("Kiepenheuerallee 5, 14469 Potsdam")
print(loc)
Campus Fachhochschule Potsdam, 5, Kiepenheuerallee, Bornstedt, Potsdam Nord, Potsdam, Brandenburg, 14469, Deutschland
We can access the geographic coordinates one by one:
print((loc.latitude, loc.longitude))
(52.4123583, 13.050748548573427)
Note that we can also use a name of a place; however, in this case the coordinates do not refer to the street address, but the center of the place:
loc = geocoder.geocode("Fachhochschule Potsdam")
print((loc.latitude, loc.longitude))
(52.4121432, 13.0507812)
✏️ Give the geocoder a try and issue a geocoding request for an address or placename of your choice:
Let's proceed with a dataset containing multiple places. Do you remember the childcare dataset we loaded in the data wrangling tutorial? The dataset actually did not include geospatial coordinates, but street addresses! Let's load the CSV again into a DataFrame:
kitas = pd.read_csv("https://opendata.potsdam.de/explore/dataset/kitaboerse-20161108/download/", sep=";")
The hausnummer column contains some non-numerical items such as months (for no apparent reason). This will cause errors later during geocoding. Below we extract the first number encountered in the hausnummer cells. Some kitas have multiple house numbers, but for the purpose of this tutorial one will suffice.
# the parameter of extract is a regular expression that matches the first set of digits
kitas.hausnummer = kitas.hausnummer.str.extract('(\d+)')
# in some cases the value is not a number, which we will replace with zeros
kitas.hausnummer = kitas.hausnummer.fillna(0)
Geocoding can take a bit of time, so we limit ourselves to just a few random entries here:
kitas = kitas.sample(20)
✏️ If you have a bit of time, increase the above number or simply comment out the line to analyze all kitas in Potsdam
Note that the address information is spread across three columns: strasse, hausnummer, postleitzahl.
We will turn them into one column, with which we query the geocoder. For the purpose of this tutorial, we keep the remaining data at a minimum and thus only keep the names of the kitas and their capacities:
kitas
| name_der_kindertagesbetreuungseinrichtung | stand_vom | barrierefreie_einrichtung | kinderkrippe_0_3_j | tagespflege_0_3_j | padagogisch_begleitete_spielgruppe_0_3_j | kindergarten_3_j_schuleintritt | hort_ab_schuleintritt | andere_kinderbetreuung_ab_3_klasse | strasse | ... | abweichende_offnungszeiten | ubernachtung_moglich | schliesstage_von_bis | fruhstuck | mittag | vesper | abendessen | versorgungsart | link_zum_bereich_kinder_und_jugend_der_landeshauptstadt_potsdam | link_zu_anmeldeinformationen_der_einrichtung | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 13 | Turmspatzen, AWO, Kita | 01.09.2016 | Ja | Ja | Nein | Nein | Ja | Ja | Nein | Kaiser-Friedrich-Str. | ... | NaN | Nein | keine | Ja | Ja | Ja | Nein | Mischversorgung | http://www.potsdam.de/kita-tipp | http://www.awo-potsdam.de/einrichtungen-und-di... |
| 86 | Farbenspiel, Kita | 01.11.2015 | Nein | Ja | Nein | Nein | Ja | Nein | Nein | Peter-Huchel-Str. | ... | NaN | Nein | NaN | Ja | Ja | Ja | Nein | keine Angabe | http://www.potsdam.de/kita-tipp | http://anmeldung.die-kinderwelt.com/ |
| 132 | Spielgruppe Waldstadt | 08.11.2016 | Nein | Nein | Nein | Ja | Nein | Nein | Nein | Ginsterweg | ... | NaN | Nein | NaN | Ja | Ja | Ja | Nein | Eigenversorgung am Standort | http://www.potsdam.de/kita-tipp | http://pbhev.de/?page_id=21 |
| 93 | Hort der evangelischen Grundschule Potsdam | 09.09.2016 | Nein | Nein | Nein | Nein | Nein | Ja | Nein | Große Weinmeisterstr. | ... | Ferien: 8 bis 16 Uhr | Nein | NaN | Nein | Nein | Nein | Nein | keine Angabe | http://www.potsdam.de/kita-tipp | http://www.hoffbauer-bildung.de/grundschule-po... |
| 106 | Nimmerland, Hort | 08.09.2016 | Nein | Nein | Nein | Nein | Nein | Ja | Nein | Karl-Marx-Str. | ... | NaN | Nein | Weihnachtsferien | Nein | Ja | Ja | Nein | Eigenversorgung geliefert | http://www.potsdam.de/kita-tipp | http://www.elternverein-zwergenland.de/wp-cont... |
| 9 | Hort an der Regenbogenschule, Fahrland | 08.11.2016 | Ja | Nein | Nein | Nein | Ja | Ja | Nein | Ketziner Str. | ... | NaN | Nein | zwischen Weihnachten und Silvester, am Freitag... | Nein | Ja | Ja | Nein | Fremdversorgung | http://www.potsdam.de/kita-tipp | http://www.treffpunkt-fahrland.de/index.php?op... |
| 83 | Inselmäuse, AWO, Kita | 01.11.2015 | Nein | Ja | Nein | Nein | Ja | Nein | Nein | Burgstr. | ... | NaN | Nein | Zwischen Weihnachten und Neujahr. Einzelne Sch... | Ja | Ja | Ja | Nein | Eigenversorgung geliefert | http://www.potsdam.de/kita-tipp | http://www.awo-potsdam.de/einrichtungen-und-di... |
| 113 | Havelsprotten, AWO, Hort | 01.11.2015 | Nein | Nein | Nein | Nein | Nein | Ja | Nein | Burgstr. | ... | NaN | Nein | NaN | Nein | Nein | Ja | Nein | Eigenversorgung geliefert | http://www.potsdam.de/kita-tipp | http://www.awo-potsdam.de/einrichtungen-und-di... |
| 91 | Evangelische Kita Hoffkids | 13.09.2016 | Nein | Ja | Nein | Nein | Ja | Nein | Nein | Alt Nowawes | ... | Fr 7.45 bis 16 Uhr | Nein | 25 Tage in den Schulferien | Ja | Ja | Ja | Nein | Fremdversorgung | http://www.potsdam.de/kita-tipp | http://www.hoffbauer-bildung.de/kita-hoffkids/... |
| 37 | Treffpunkt Freizeit, AKi | 01.11.2015 | Nein | Nein | Nein | Nein | Nein | Ja | Ja | Am Neuen Garten | ... | NaN | Nein | NaN | Nein | Nein | Nein | Nein | keine Angabe | http://www.potsdam.de/kita-tipp | http://www.treffpunktfreizeit.de |
| 77 | Kinderclub Einsteinkids, AKi | 21.09.2016 | Nein | Nein | Nein | Nein | Nein | Nein | Ja | Knobelsdorffstr. | ... | NaN | Nein | NaN | Nein | Nein | Nein | Nein | keine Angabe | http://www.potsdam.de/kita-tipp | https://www.ejf.de/kinder-und-jugendhilfe/alle... |
| 66 | first steps Deutsch-Englischer Kindergarten | 01.03.2016 | Nein | Ja | Nein | Nein | Ja | Nein | Nein | Ludwig-Richter-Str. | ... | Bei Bedarf Änderungen in Absprache möglich. | Nein | NaN | Ja | Ja | Ja | Nein | Mischversorgung | http://www.potsdam.de/kita-tipp | http://www.firststeps-kita.de/aufnahme/anmeldu... |
| 127 | Tönemaler, Kita | 01.09.2016 | Nein | Ja | Nein | Nein | Ja | Nein | Nein | David-Gilly-Str. | ... | NaN | Nein | NaN | Ja | Ja | Ja | Nein | Mischversorgung | http://www.potsdam.de/kita-tipp | http://www.gfb-potsdam.de/CMS/index.php?option... |
| 30 | Elterninitiativkita Butzemannhaus e.V. | 01.09.2016 | Nein | Ja | Nein | Nein | Ja | Nein | Nein | Seepromenade | ... | Fr bis 16:30 Uhr | Nein | 06.05.16, Brückentag, 21.12.16- 03.01.2017, We... | Ja | Ja | Ja | Nein | Eigenversorgung am Standort | http://www.potsdam.de/kita-tipp | http://butzemannhaus.grossglienicke.de/index.html |
| 8 | Kinderhaus Fridolin | 01.11.2015 | Nein | Ja | Nein | Nein | Ja | Nein | Nein | Alleestr. | ... | NaN | Nein | 5 Fortbildungstage | Ja | Ja | Ja | Ja | Mischversorgung | http://www.potsdam.de/kita-tipp | www.fidl.de |
| 5 | Oberlin-Kindertagesstätte Eiche | 01.02.2016 | Nein | Ja | Nein | Nein | Ja | Ja | Nein | Kaiser-Friedrich-Str. | ... | Hort: 6.30 bis 7:45 Uhr und 11.30 bis 18 Uhr | Nein | NaN | Ja | Ja | Ja | Nein | Eigenversorgung am Standort | http://www.potsdam.de/kita-tipp | NaN |
| 55 | Kids Company, Kita | 01.11.2015 | Nein | Ja | Nein | Nein | Ja | Nein | Nein | Potsdamer Str. | ... | NaN | Nein | 06.11.2015, 24.12.2015-01.01.2016, 06.05.2016,... | Ja | Ja | Ja | Nein | Eigenversorgung geliefert | http://www.potsdam.de/kita-tipp | http://www.kita-kidscompany.de/index.php/anmel... |
| 135 | Tagespflegepersonen FidL - Frauen in der Leben... | 01.11.2015 | Nein | Nein | Ja | Nein | Nein | Nein | Nein | Alleestr. | ... | Terminvereinbarung per Mail info@fidl.de oder ... | Nein | NaN | Ja | Ja | Ja | Nein | keine Angabe | http://www.potsdam.de/kita-tipp | http://fidl.de/kinderbetreuung/tagespflege |
| 78 | Bergmännchen, Kita | 08.11.2016 | Nein | Ja | Nein | Nein | Ja | Nein | Nein | Charlottenstr. | ... | NaN | Nein | NaN | Ja | Ja | Ja | Nein | keine Angabe | http://www.potsdam.de/kita-tipp | http://www.hoffbauer-bildung.de/kita-bergmaenn... |
| 90 | Firlefanz, Kita | 01.11.2015 | Ja | Ja | Nein | Nein | Ja | Nein | Nein | Nedlitzer Holz | ... | NaN | Nein | 06.11.2015 pädagogischer Tag; 24.12.2015 - 01.... | Ja | Ja | Ja | Nein | Eigenversorgung geliefert | http://www.potsdam.de/kita-tipp | http://www.kita-firlefanz.de/resources/aufnahm... |
20 rows × 49 columns
# names of the childcare places
names = kitas["name_der_kindertagesbetreuungseinrichtung"]
# the capacity of the kitas; which we turn into integers
capac = kitas["platze_unbefristet"].fillna(0).astype(int)
# columns containing address information
addr = ['strasse', 'hausnummer', 'postleitzahl']
# we join the values in the three address-related columns for each row
addresses = kitas[addr].apply(lambda row: ' '.join(row.values.astype(str)), axis=1)
# the dataframe we will use
kitas = pd.DataFrame({'name': names, 'capacity': capac, 'address': addresses})
kitas
| name | capacity | address | |
|---|---|---|---|
| 13 | Turmspatzen, AWO, Kita | 205 | Kaiser-Friedrich-Str. 15 14469 Potsdam |
| 86 | Farbenspiel, Kita | 120 | Peter-Huchel-Str. 1 14469 Potsdam |
| 132 | Spielgruppe Waldstadt | 15 | Ginsterweg 3 14478 Potsdam |
| 93 | Hort der evangelischen Grundschule Potsdam | 250 | Große Weinmeisterstr. 49 14469 Potsdam |
| 106 | Nimmerland, Hort | 30 | Karl-Marx-Str. 72 14482 Potsdam |
| 9 | Hort an der Regenbogenschule, Fahrland | 204 | Ketziner Str. 31 14476 Potsdam |
| 83 | Inselmäuse, AWO, Kita | 63 | Burgstr. 23 14467 Potsdam |
| 113 | Havelsprotten, AWO, Hort | 240 | Burgstr. 23 14467 Potsdam |
| 91 | Evangelische Kita Hoffkids | 23 | Alt Nowawes 94 14482 Potsdam |
| 37 | Treffpunkt Freizeit, AKi | 25 | Am Neuen Garten 64 14469 Potsdam |
| 77 | Kinderclub Einsteinkids, AKi | 29 | Knobelsdorffstr. 7 14471 Potsdam |
| 66 | first steps Deutsch-Englischer Kindergarten | 20 | Ludwig-Richter-Str. 9 14467 Potsdam |
| 127 | Tönemaler, Kita | 84 | David-Gilly-Str. 3 14469 Potsdam |
| 30 | Elterninitiativkita Butzemannhaus e.V. | 60 | Seepromenade 54 14476 Potsdam |
| 8 | Kinderhaus Fridolin | 78 | Alleestr. 11 14469 Potsdam |
| 5 | Oberlin-Kindertagesstätte Eiche | 123 | Kaiser-Friedrich-Str. 106 14469 Potsdam |
| 55 | Kids Company, Kita | 71 | Potsdamer Str. 63 14469 Potsdam |
| 135 | Tagespflegepersonen FidL - Frauen in der Leben... | 90 | Alleestr. 1 14469 Potsdam |
| 78 | Bergmännchen, Kita | 109 | Charlottenstr. 72 14467 Potsdam |
| 90 | Firlefanz, Kita | 54 | Nedlitzer Holz 12 14469 Potsdam |
Now let's turn addresses into geometries! For each cell in the address column, a query against OSM will be triggered; to spread out the load we use a RateLimiter provided by GeoPy:
from geopy.extra.rate_limiter import RateLimiter
# add a delay of one second between each geocoding request
geocode = RateLimiter(geocoder.geocode, min_delay_seconds=1)
Next we invoke the geocoder and apply it to the address column (depending on how many entries are to be geocoded, this can take a while):
# apply geocoding to address column; store responses in location column
kitas['location'] = kitas['address'].apply(geocode)
There might be some kitas that have not location information, i.e., for which the geocoder was not able to identify latitude and longitude. We only keep those rows that have location information, i.e., that are notnull():
kitas = kitas[kitas['location'].notnull()]
After that we use one list comprehension to extract latitudes and longitudes from the locations column, which we will then use to transform the DataFrame into a GeoDataFrame featuring its own geometry column:
# create empty columns for coordinates
kitas.loc[:, "lat"] = None
kitas.loc[:, "lon"] = None
# extract lat and lon from locations via one list comprehensions
kitas.loc[:, ['lat', 'lon']] = [ (loc.latitude, loc.longitude) for loc in kitas['location'] ]
# # create GeoDataFrame, pointing explicitly to lon and lat columns
kitas = gpd.GeoDataFrame(kitas, geometry=gpd.points_from_xy(kitas.lon, kitas.lat))
# # remove superfluous columns that are not needed anymore
kitas = kitas.drop(columns=['location', 'lat', 'lon'])
kitas
| name | capacity | address | geometry | |
|---|---|---|---|---|
| 13 | Turmspatzen, AWO, Kita | 205 | Kaiser-Friedrich-Str. 15 14469 Potsdam | POINT (12.99432 52.40529) |
| 86 | Farbenspiel, Kita | 120 | Peter-Huchel-Str. 1 14469 Potsdam | POINT (13.05312 52.42447) |
| 132 | Spielgruppe Waldstadt | 15 | Ginsterweg 3 14478 Potsdam | POINT (13.09033 52.36659) |
| 93 | Hort der evangelischen Grundschule Potsdam | 250 | Große Weinmeisterstr. 49 14469 Potsdam | POINT (13.06226 52.41541) |
| 106 | Nimmerland, Hort | 30 | Karl-Marx-Str. 72 14482 Potsdam | POINT (13.12186 52.39552) |
| 9 | Hort an der Regenbogenschule, Fahrland | 204 | Ketziner Str. 31 14476 Potsdam | POINT (13.01783 52.46601) |
| 83 | Inselmäuse, AWO, Kita | 63 | Burgstr. 23 14467 Potsdam | POINT (13.06774 52.39741) |
| 113 | Havelsprotten, AWO, Hort | 240 | Burgstr. 23 14467 Potsdam | POINT (13.06774 52.39741) |
| 91 | Evangelische Kita Hoffkids | 23 | Alt Nowawes 94 14482 Potsdam | POINT (13.09017 52.39722) |
| 37 | Treffpunkt Freizeit, AKi | 25 | Am Neuen Garten 64 14469 Potsdam | POINT (13.06460 52.40815) |
| 77 | Kinderclub Einsteinkids, AKi | 29 | Knobelsdorffstr. 7 14471 Potsdam | POINT (13.02383 52.38774) |
| 66 | first steps Deutsch-Englischer Kindergarten | 20 | Ludwig-Richter-Str. 9 14467 Potsdam | POINT (13.07678 52.41135) |
| 127 | Tönemaler, Kita | 84 | David-Gilly-Str. 3 14469 Potsdam | POINT (13.03723 52.41535) |
| 30 | Elterninitiativkita Butzemannhaus e.V. | 60 | Seepromenade 54 14476 Potsdam | POINT (13.10524 52.46002) |
| 8 | Kinderhaus Fridolin | 78 | Alleestr. 11 14469 Potsdam | POINT (13.05998 52.40949) |
| 5 | Oberlin-Kindertagesstätte Eiche | 123 | Kaiser-Friedrich-Str. 106 14469 Potsdam | POINT (12.99447 52.40504) |
| 55 | Kids Company, Kita | 71 | Potsdamer Str. 63 14469 Potsdam | POINT (13.00990 52.42016) |
| 135 | Tagespflegepersonen FidL - Frauen in der Leben... | 90 | Alleestr. 1 14469 Potsdam | POINT (13.05937 52.40886) |
| 78 | Bergmännchen, Kita | 109 | Charlottenstr. 72 14467 Potsdam | POINT (13.06582 52.40118) |
| 90 | Firlefanz, Kita | 54 | Nedlitzer Holz 12 14469 Potsdam | POINT (13.05218 52.42579) |
When we have geospatial data readily available as GeoDataFrames, we can now map them with Altair.
(There are other mapping libraries for Python, such as Folium, that provide additional functionalities. Altair's geovis features are basic, but do provide some variety of techniques and have the benefit of working consistently with the other chart types we covered.)
A simple start is placing locations on a base map and adding a bit of further information via tooltips. Let's do this with Potsdam's schools! First, we can have another look at the attributes:
schools.head(5)
| schulnum_1 | schulnumme | ort | trager | y_e89_rbs | plz | sozialraum | planungsra | status | schulname | schulform | strasse | standort | x_e89_rbs | geometry | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 105478 | 12 | Potsdam | öffentlich | 5.806710e+06 | 14471 | III | 303 | Aktiv | Gerhart-Hauptmann-Grundschule Potsdam | G | Carl-von-Ossietzky-Straße 37 | Hauptstandort | 366239.9998 | POINT (13.03417 52.39427) |
| 1 | 105491 | 16 | Potsdam | öffentlich | 5.806730e+06 | 14482 | IV | 402 | Aktiv | Grundschule Bruno H. Bürgel | G | Karl-Liebknecht-Straße 29 | Hauptstandort | 370309.9999 | POINT (13.09394 52.39543) |
| 2 | 106770 | 17 | Potsdam | öffentlich | 5.810276e+06 | 14469 | II | 201 | Aktiv | Grundschule Am Jungfernsee | G | Fritz-von-der-Lancken-Straße 2 | Hauptstandort | 367861.5560 | POINT (13.05657 52.42671) |
| 3 | 401250 | 18 | Potsdam | öffentlich | 5.802902e+06 | 14478 | VI | 604 | Aktiv | Fröbelschule Schule mit dem sonderpädagogische... | F | Zum Teufelssee 6 | Hauptstandort | 369969.3860 | POINT (13.09042 52.36096) |
| 4 | 105466 | 20 | Potsdam | öffentlich | 5.803430e+06 | 14480 | V | 502 | Aktiv | Grundschule Am Priesterweg | G | Oskar-Meßter-Straße 4-6 | Hauptstandort | 372949.9999 | POINT (13.13397 52.36640) |
This gives us plenty of aspects to visualize.
We will now create a simple map with markers in the form of an Altair chart consisting of two layers:
districts form the lower layer representing their boundaries and the overall geographic shape of Potsdamschools are the points of interests that are displayed on topWhen putting the two layers together they should actually refer to the same geographic location to make sense. Here the districts and schools both refer to Potsdam. Also note that the order when the charts are added together determines the vertical order: first the basemap and then the markers on top.
# 1. mark_geoshape transparently uses the geometry column
basemap = alt.Chart(districts).mark_geoshape(
# add some styling to reduce the salience of the basemap
fill="lightgray", stroke="darkgray"
).properties(width=600, height=600)
# 2. we use mark_circle to have more control over the visual variables
markers = alt.Chart(schools).mark_circle(opacity=1).encode(
# point latitude & longitude to coordinates in the geometry column
longitude='geometry.coordinates[0]:Q',
latitude='geometry.coordinates[1]:Q',
tooltip=['schulname', 'strasse', 'trager'],
)
# combine the two layers
basemap + markers
✏️ How about changing the color of the dots according to trager or schulform?
This technique adjusts the visual features of markers to encode quantitative data dimensions. For example, we can use varying sizes of circles to represent the capacities of Potsdam's kitas. We use the same two-layer structure we used above:
basemap = alt.Chart(districts).mark_geoshape(
fill="lightgray", stroke="darkgray"
).properties(width=600, height=600)
markers = alt.Chart(kitas).mark_circle(opacity=1).encode(
longitude='geometry.coordinates[0]:Q',
latitude='geometry.coordinates[1]:Q',
tooltip=['name:N', 'address:N', 'capacity:Q'],
size="capacity:Q"
)
basemap + markers
When we are dealing with thousands of elements, we are reaching perceptual and technical limitations. One way to mitigate the technical limitations is to take a sample of the elements, large enough to see overall patterns. This is what we are now doing with Berlin's trees, of which there are far too many to display individually, however, a sample might still be informative:
# by default Altair handles a max number of 5000 items only
# the following line disables this limitation; read more here:
# https://altair-viz.github.io/user_guide/large_datasets.html
alt.data_transformers.disable_max_rows()
# to not overburden altair, we're passing a sample of 10k trees
treemap = alt.Chart(trees.sample(n=10000)).mark_circle(
# reduce the visual presence of each element
size=5,
# with a low dot opacity we can use overplotting to indicate densities
opacity=.25,
# a natural choice
color="green"
).encode(
longitude='geometry.coordinates[0]:Q',
latitude='geometry.coordinates[1]:Q'
).properties(width=600, height=600)
treemap
✏️ Add a base layer underneath with Berlin's districts; the Technologiestiftung Berlin offers spatial data in various shapes and sizes including for district boundaries. Hint: you might have to flip the winding order of the geometries.
Finally, let's create the geovisualization that uses the fill color of geospatial shapes to encode quantitative data. To illustrate this, we will visualize the population densities around the world. We will use area and population information from GeoNames and get the geographic shapes of countries from DataHub.
# load country data from geonames
geonames = pd.read_csv("https://www.geonames.org/countryInfoCSV", sep='\t')
# select four columns
geonames = geonames[['name', 'iso alpha3', 'areaInSqKm', 'population']]
# set index to country code
geonames = geonames.set_index("iso alpha3")
geonames.head()
| name | areaInSqKm | population | |
|---|---|---|---|
| iso alpha3 | |||
| AND | Andorra | 468.0 | 77006 |
| ARE | United Arab Emirates | 82880.0 | 9630959 |
| AFG | Afghanistan | 647500.0 | 37172386 |
| ATG | Antigua and Barbuda | 443.0 | 96286 |
| AIA | Anguilla | 102.0 | 13254 |
Next we collect the geographic boundaries and simplify them a bit, as they have more detail than what we need here:
# load country's polygons from datahub
polygons = gpd.read_file("https://datahub.io/core/geo-countries/r/countries.geojson")
# remove country names, as we have them already
polygons = polygons.drop(columns=["ADMIN"])
# set index to country code
polygons = polygons.set_index("ISO_A3")
# reduce the complexity of the shapes
polygons.geometry = polygons.geometry.simplify(.1)
polygons.info()
<class 'geopandas.geodataframe.GeoDataFrame'> Index: 255 entries, ABW to ZWE Data columns (total 1 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geometry 255 non-null geometry dtypes: geometry(1) memory usage: 12.1+ KB
As both DataFrames use the three-letter country codes as indices, we can join them like this:
# inner means that we keep only those countries
# for which we have geometric and attribute data
countries = polygons.join(geonames, how='inner')
countries.info()
<class 'geopandas.geodataframe.GeoDataFrame'> Index: 238 entries, ABW to ZWE Data columns (total 4 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 geometry 238 non-null geometry 1 name 238 non-null object 2 areaInSqKm 238 non-null float64 3 population 238 non-null int64 dtypes: float64(1), geometry(1), int64(1), object(1) memory usage: 17.4+ KB
Visualizing area or population in a choropleth map, arguably, makes little sense. So let's compute population densities:
countries["density"] = countries["population"] / countries["areaInSqKm"]
Keep only those countries with valid density value and turn these densities into integers:
countries = countries[countries['density'].notna()]
countries.density = countries.density.round(0).astype(int)
There is one ‘country’ that is not really one, which is Antarctica. We'll remove this from the list here.
countries = countries.drop("ATA")
One last manipulation: Polygons are drawn in a specific "winding order" (clockwise or counter-clockwise) in order to distinguish between polygons that go outwards or inwards. Altair doesn't follow the widely used conventions, so we need to turn everything around.
from shapely.ops import orient
countries.geometry = countries.geometry.apply(orient, args=(-1,))
Finally, we draw the chart using Altair's mark_geoshape() method. The distribution of densities is highly skewed, due to very small countries with relatively high population numbers, such as Monaco. To spread out the low and high density values we use a logarithmic scale and set the domain between 1 and 1000. Note that the domain has to end in a multiple of the base, which is by default 10.
alt.Chart(countries).mark_geoshape().encode(
color=alt.Color('density', scale=alt.Scale(type="log", domain=[1,1000] )),
tooltip=['name', 'areaInSqKm', 'population', 'density']
).project(
# enter different projection here
).properties(
width=750,
height=500
)
The map is shown in the default Mercator projection, which particularly distorts the area sizes of North America, Europe and Russia in contrast to Africa, Southern Asia and parts of South America.
✏️ Change the projection used above to one that does not distort area sizes as much (see this list for options).